{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Archives and DataSets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two main classes provided by the `persist` module: `persist.Archive` and `persist.DataSet`.\n", "\n", "Archives deal with the linkage between objects so that if multiple objects are referred to, they are only stored once in the archive. Archives provide two main methods for to serialize the data:\n", "\n", "1. Via the `str()` operator which will return a string that can be executed to restore the archive.\n", "2. Via the `Archive.save()` method which will export the archive to an importable python package or module.\n", "\n", "DataSets use archives to provide storage for multiple sets of data along with associated metadata. Each set of data is designed to be accessed concurrently using locks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Archive Format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `persist.Archive` object maintains a collection of python objects that are inserted with `persist.Archive.insert()`. This can be serialized to a string and then reconstituted through evaluation:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Usage" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "from builtins import range as _range\n", "_g3 = _range(0, 3)\n", "x = _range(0, 2)\n", "b = [x, _g3, _g3]\n", "a = 1\n", "del _range\n", "del _g3\n", "try: del __builtins__, _arrays\n", "except NameError: pass\n" ] } ], "source": [ "from persist.archive import Archive\n", "\n", "a = 1\n", "x = range(2)\n", "y = range(3) # Implicitly reference in archive\n", "b = [x, y, y] # Nested references to x and y\n", "\n", "# scoped=False is prettier, but slower and not as safe\n", "archive = Archive(scoped=False)\n", "archive.insert(a=a, x=x, b=b)\n", "\n", "# Get the string representation\n", "s = str(archive)\n", "print(s)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'x': range(0, 2), 'b': [range(0, 2), range(0, 3), range(0, 3)], 'a': 1}\n" ] } ], "source": [ "d = {}\n", "exec(s, d)\n", "print(d)\n", "assert d['a'] == a\n", "assert d['x'] == x\n", "assert d['b'] == b\n", "assert d['b'][1] is d['b'][2] # Note: these are the same object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Large Arrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have large arrays of data, then it is better to store those externally. To do this, set the `array_threshold` to specify the maximum number of elements to store in an inline array. Any large array will be stored in `Archive.data` and not be included in the string representation. To properly reconstitute the archive, this data must be provided in the environment as a dictionary with key `Archive.data_name` which defaults to `_arrays`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x = _arrays['array_0']\n", "b = [x, _arrays['array_1']]\n", "a = 1\n", "try: del __builtins__, _arrays\n", "except NameError: pass\n", "{'array_0': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'array_1': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])}\n" ] } ], "source": [ "import os.path, tempfile, shutil, numpy as np\n", "from persist.archive import Archive\n", "\n", "a = 1\n", "x = np.arange(10)\n", "y = np.arange(20) # Implicitly reference in archive\n", "b = [x, y]\n", "\n", "archive = Archive(scoped=False, array_threshold=5)\n", "archive.insert(a=a, x=x, b=b)\n", "\n", "# Get the string representation\n", "s = str(archive)\n", "print(s)\n", "print(archive.data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate the string representation, we need to provide the `_arrays` dictionary:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'x': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'b': [array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])], 'a': 1}\n" ] } ], "source": [ "d = dict(_arrays=archive.data)\n", "exec(s, d)\n", "print(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To store the data, use `Archive.save_data()`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmp5jutcwlt\n", "array_0.npy array_1.npy\r\n" ] } ], "source": [ "import os.path, tempfile\n", "tmpdir = tempfile.mkdtemp() # Make temporary directory for data\n", "datafile = os.path.join(tmpdir, 'arrays')\n", "archive.save_data(datafile=datafile)\n", "print(tmpdir)\n", "!ls $tmpdir/arrays\n", "!rm -rf $tmpdir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importable Archives\n", "\n", "*(New in version 1.0)*\n", "\n", "Archives can be saved as importable packages using the `save()` method. This will write the representable portion of the archive as an importable module with additional code to load any external arrays. Archives can be saved as a full package – a directory with a `/__init__.py` file etc. or as a single `.py` module. These can be imported without the `persist` package:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[01;34m/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmp9ahg71yq\u001b[00m\r\n", "|-- \u001b[01;34mmod1\u001b[00m\r\n", "| |-- __init__.py\r\n", "| `-- \u001b[01;34m_arrays\u001b[00m\r\n", "| |-- array_0.npy\r\n", "| `-- array_1.npy\r\n", "|-- mod2.py\r\n", "`-- \u001b[01;34mmod2_arrays\u001b[00m\r\n", " |-- array_0.npy\r\n", " `-- array_1.npy\r\n", "\r\n", "3 directories, 6 files\r\n" ] } ], "source": [ "import os.path, sys, tempfile, shutil, numpy as np\n", "from persist.archive import Archive\n", "\n", "tmpdir = tempfile.mkdtemp()\n", "\n", "a = 1\n", "x = np.arange(10)\n", "y = np.arange(20) # Implicitly reference in archive\n", "b = [x, y]\n", "\n", "archive = Archive(array_threshold=5)\n", "archive.insert(a=a, x=x, b=b)\n", "archive.save(dirname=tmpdir, name='mod1', package=True)\n", "archive.save(dirname=tmpdir, name='mod2', package=False)\n", "\n", "!tree $tmpdir\n", "\n", "sys.path.append(tmpdir)\n", "import mod1, mod2\n", "sys.path.pop()\n", "for mod in [mod1, mod2]:\n", " assert mod.a == a and np.allclose(mod.x, x)\n", "!rm -rf $tmpdir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Archive Details" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Single-item Archives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*(New in version 1.0)*\n", "\n", "If an archive contains a single item, then the representation can be simplified so that importing the module will result in the actual object. This is mainly for use in `DataSet`s where it allows us to have large objects in a module that only get loaded if explicitly imported. In this case, one can also omit the name when calling `Archive.save()` as it defaults to the name of the single item." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[01;34m/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmpval74y_i\u001b[00m\r\n", "|-- \u001b[01;34mb1\u001b[00m\r\n", "| |-- __init__.py\r\n", "| `-- \u001b[01;34m_arrays\u001b[00m\r\n", "| |-- array_0.npy\r\n", "| `-- array_1.npy\r\n", "|-- b2.py\r\n", "`-- \u001b[01;34mb2_arrays\u001b[00m\r\n", " |-- array_0.npy\r\n", " `-- array_1.npy\r\n", "\r\n", "3 directories, 6 files\r\n" ] } ], "source": [ "import os.path, sys, tempfile, shutil, numpy as np\n", "from persist.archive import Archive\n", "\n", "tmpdir = tempfile.mkdtemp()\n", "\n", "x = np.arange(10)\n", "y = np.arange(20) # Implicitly reference in archive\n", "b = [x, y]\n", "\n", "archive = Archive(single_item_mode=True, array_threshold=5)\n", "archive.insert(b1=b)\n", "archive.save(dirname=tmpdir, package=True)\n", "\n", "archive = Archive(scoped=False, single_item_mode=True, array_threshold=5)\n", "archive.insert(b2=b)\n", "archive.save(dirname=tmpdir, package=False)\n", "\n", "!tree $tmpdir\n", "\n", "sys.path.append(tmpdir)\n", "import b1, b2\n", "sys.path.pop()\n", "for b_ in [b1, b2]:\n", " assert np.allclose(b_[0], x) and np.allclose(b_[1], y)\n", "!rm -rf $tmpdir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note what is happening here... although we explicitly `import b1`, the result of this is that `b1 = [x, y]` is the list rather than a module. This behaviour is somewhat of an abuse of the import system, so you should not expose it too much. The use in `DataSet` is that these modules are included as submodules of the DataSet package, acting as attributes of the top-level package, but only being loaded when explicitly imported to limit memory usage etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Containers, Duplicates, and Circular References\n", "\n", "The main complexity with archives is with objects like lists and dictionaries that refer to other objects: all object referenced in such \"containers\" need to be stored only one time in the archive. A current limitation is that circular dependencies cannot be resolved. The pickling mechanism provides a way to restore circular dependencies, but I do not see an easy way to resolve this in a human-readable format, so the current requirement is that the references in an object form a directed acyclic graph (DAG)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Archive Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we demonstrate a simple archive containing all of the data. We start with the simplest format which is obtained with `scoped=False`:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Non-scoped (flat) Format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start with the `scoped=False` format. This produces a flat archive that is easier to read:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 778 µs, sys: 210 µs, total: 988 µs\n", "Wall time: 927 µs\n", "from builtins import range as _range\n", "_g3 = _range(0, 3)\n", "x = _range(0, 2)\n", "b = [x, _g3, _g3]\n", "a = 1\n", "del _range\n", "del _g3\n", "try: del __builtins__, _arrays\n", "except NameError: pass\n" ] } ], "source": [ "import os.path, tempfile, shutil\n", "from persist.archive import Archive\n", "\n", "a = 1\n", "x = range(2)\n", "y = range(3) # Implicitly reference in archive\n", "b = [x, y, y] # Nested references to x and y\n", "\n", "archive = Archive(scoped=False)\n", "archive.insert(a=a, x=x, b=b)\n", "\n", "# Get the string representation\n", "%time s = str(archive)\n", "print(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that intermediate objects not explicitly inserted are stored with variables like `_g#` and that these are deleted, so that evaluating the string in a dictionary gives a clean result:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'x': range(0, 2), 'b': [range(0, 2), range(0, 3), range(0, 3)], 'a': 1}\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now execute the representation to get the data\n", "d = {}\n", "exec(s, d)\n", "print(d)\n", "d['b'][1] is d['b'][2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The potential problem with the flat format is that to obtain this simple representation, a graph reduction is performed that replaces intermediate nodes, ensuring that local variables do not have name clashes as well as simplifing the representation. Replacing variables in representations can have performance implications if the objects are large. The fastest approach is a string replacement, but this can make mistakes if the substring appears in data. The option `robust_replace` invokes the python AST parser, but this is slower." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scoped Format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To alleviate these issues, the `scoped=True` format is provided. This is visually much more complicated as each object is constructed in a function. The advantage is that this provides a local scope in which objects are defined. As a result, any local variables defined in the representation of the object can be used as they are without worrying that they will conflict with other names in the file. No reduction is performed and no replacements are made, makeing the method faster and more robust, but less attractive if the files need to be inspected by humans:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 412 µs, sys: 22 µs, total: 434 µs\n", "Wall time: 452 µs\n", "\n", "def _g3():\n", " from builtins import range\n", " return range(0, 3)\n", "_g3 = _g3()\n", "\n", "def x():\n", " from builtins import range\n", " return range(0, 2)\n", "x = x()\n", "\n", "def b(_l_0=x,_l_1=_g3,_l_2=_g3):\n", " return [_l_0, _l_1, _l_2]\n", "b = b()\n", "a = 1\n", "del _g3\n", "try: del __builtins__, _arrays\n", "except NameError: pass\n" ] } ], "source": [ "archive = Archive(scoped=True)\n", "archive.insert(a=a, x=x, b=b)\n", "\n", "# Get the string representation\n", "%time s = str(archive)\n", "print(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# DataSet Format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the new format of DataSets starting with revision 1.0.\n", "\n", "A DataSet is a directory with the following files:\n", "\n", "* `_this_dir_is_a_DataSet`: This is an empty file signifying that the directory is a DataSet.\n", "\n", "* `__init__.py`: Each DataSet is an importable python module so that the data can be used on a machine without the `persist` package. This file contains the following variable:\n", " * `_info_dict`: This is a dictionary/namespace with string keys (which must be valid python identifiers) and associated data (which should in general be small). These are intended to be interpreted as meta-data.\n", " \n", "For the remainder of this discussion, we shall assume that `_info_dict` contains the key `'x'`.\n", " \n", "* `x.py`: This is the python file responsible for loading the data associated with the key `'x'` in `_info_dict`. If the size of the array is less than the `array_threshold` specified in the `DataSet` object, then the data for the arrays are stored in this file, otherwise this file is responsible for loading the data from an associated file.\n", "\n", "* `x_data.*`: If the size of the array stored in `x` is larger than the `array_threshold`, then the data associated with `x` is stored in this file/directory which may be an HDF5 file, or a numpy array file.\n", "\n", "These DataSet modules can be directly imported. Importing the top-level DataSet will result in a module with the `_info_dict` attribute containing all the meta data. The data items become available when you explicitly import them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# DataSet Examples" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Storing dataset in /var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmplw345fr6\n", "\u001b[01;34m/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmplw345fr6\u001b[00m\n", "`-- \u001b[01;34mdataset\u001b[00m\n", " |-- __init__.py\n", " |-- _this_dir_is_a_DataSet\n", " |-- a.py\n", " |-- x.py\n", " `-- \u001b[01;34mx_data\u001b[00m\n", " `-- array_0.npy\n", "\n", "2 directories, 5 files\n", "A small array\n", "A list with a small and large array\n", "[0 1 2 3 4 5 6 7 8 9]\n", "[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])]\n" ] } ], "source": [ "import os.path, pprint, sys, tempfile, shutil, numpy as np\n", "from persist.archive import DataSet\n", "tmpdir = tempfile.mkdtemp() # Make temporary directory for dataset\n", "print(\"Storing dataset in {}\".format(tmpdir))\n", "\n", "a = np.arange(10)\n", "x = np.arange(15)\n", "\n", "ds = DataSet('dataset', 'w', path=tmpdir, array_threshold=12, data_format='npy')\n", "\n", "ds.a = a\n", "ds.x = [a, x]\n", "ds['a'] = \"A small array\"\n", "ds['x'] = \"A list with a small and large array\"\n", "\n", "!tree $tmpdir\n", "\n", "del ds\n", "\n", "ds = DataSet('dataset', 'r', path=tmpdir)\n", "print(ds['a'])\n", "print(ds['x'])\n", "print(ds.a) # The arrays a and x are not actually loaded until here\n", "print(ds.x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an alternative, you can directly import the dataset without the need for the `persist` library. This also has the feature of delayed loading:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "import dataset: The dataset module initially contains\n", "['__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_info_dict']\n", "import dataset.a, dataset.x: The dataset module now contains\n", "['__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_info_dict', 'a', 'x']\n" ] } ], "source": [ "sys.path.append(tmpdir)\n", "import dataset # Only imports _info_dict at this point\n", "print\n", "print('import dataset: The dataset module initially contains')\n", "print(dir(dataset))\n", "\n", "import dataset.a, dataset.x # Now we get a and x\n", "print\n", "print('import dataset.a, dataset.x: The dataset module now contains')\n", "print(dir(dataset))\n", "\n", "sys.path.pop()\n", "\n", "shutil.rmtree(tmpdir) # Remove files" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda env:work]", "language": "python", "name": "conda-env-work-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "toc": { "base_numbering": 1, "nav_menu": { "height": "66px", "width": "252px" }, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": "block", "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }